Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach
نویسندگان
چکیده
Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic “someone” tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.
منابع مشابه
Automatic Video Captioning via Multi-channel Sequential Encoding
In this paper, we propose a novel two-stage video captioning framework composed of 1) a multi-channel video encoder and 2) a sentence-generating language decoder. Both of the encoder and decoder are based on recurrent neural networks with long-short-term-memory cells. Our system can take videos of arbitrary lengths as input. Compared with the previous sequence-to-sequence video captioning frame...
متن کاملFrom Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning
Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden st...
متن کاملVideo Captioning via Hierarchical Reinforcement Learning
Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the cha...
متن کاملMulti-modal Multi-task Learning for Automatic Dietary Assessment
We investigate the task of automatic dietary assessment: given meal images and descriptions uploaded by real users, our task is to automatically rate the meals and deliver advisory comments for improving users’ diets. To address this practical yet challenging problem, which is multi-modal and multi-task in nature, an end-to-end neural model is proposed. In particular, comprehensive meal represe...
متن کاملTowards an intelligent framework for multimodal affective data analysis
An increasingly large amount of multimodal content is posted on social media websites such as YouTube and Facebook everyday. In order to cope with the growth of such so much multimodal data, there is an urgent need to develop an intelligent multi-modal analysis framework that can effectively extract information from multiple modalities. In this paper, we propose a novel multimodal information e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017